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Abstract 


We give an unique string representation, up to isomorphism, for initially connected 
deterministic finite automata (ICDFA’s) with n states over an alphabet of k symbols. 
We show how to generate all these strings for each n and k, and how its enumeration 
provides an alternative way to obtain the exact number of ICDFA’s. 


1 Motivation 


In symbolic manipulation environments for finite automata, it is important to have an 
adequate representation of automata and, dependent upon their use, several representations 
may be available. For example, for testing if two finite automata are isomorphic objects 
or for (random) generation of automata, the representation must be compact and somehow 
canonical. In the FAdo project [MR05a, fad] a canonical form is used to test if two minimal 
DFA’s are isomorphic (i.e are the same up to renaming of states). In this paper we prove 
the correctness of that representation and show how it can be used for the exact enumeration 
and generation of initially connected deterministic finite automata (ICDFA). 

The problem of enumeration of finite automata was considered by several authors since 
early 1960s, in particular see Harrison [Har65], Robinson [Rob85], Harary and Palmer [HP67] 
and Liskovets [Lis69] amongst many others. A survey may be found in Domaratzki et 
al. [DKS02]. More recently, several authors examined related problems. Domaratzki et 
al. [DKS02] studied the enumeration of distinct languages accepted by finite automata 
with n states; Nicaud [Nic99], Champarnaud and Paranthoén [CP05, Par04] and Bassino 
and Nicaud [BN] analysed several aspects of the average behaviour of regular languages; 
Liskovets [Lis03] and Domaratzki [Dom04] gave (exact and asymptotic) enumerations of 
acyclic DFA’s and of finite languages. 

The paper is organised as follows. In the next section, we review some basic notions 
and introduce some notation. Section 3 describes a string representation for deterministic 
finite automata that is unique up to isomorphism for initially connected deterministic finite 
automata. Section 4 presents an efficient method to generate those strings. Section 5 shows 
how their enumeration provides an upper bound and the exact value for the number of 
ICDFA’s. Section 6 and Appendix A report some implementation issues and final remarks. 


2 Preliminaries 


We first recall some basic notions from automata theory and formal languages, that can be 
found in standard books [HMU00]. An alphabet © is a nonempty set of symbols. A string 
over » is a finite sequence of symbols of X. The empty string is denoted by e. The set X* 
is the set of all strings over X. A language L is a subset of X*. The density of a language 
L over È, p(n), is the number of strings of length n that are in L, i.e., pr(n) = |LND”. 
A regular expression (r.e.) a over © represents a language L(a) C X* and is inductively 
defined by: Ø, cando € © are a r.e., where L(h) = 0, L(e) = (e) and L(c) = {0}; if ay and 
ag are r.e., (a1 +2), (a1@2) and aj are r.e., respectively with L((a1 +a2)) = L(a,)UL/(az), 
L((aia2)) = L(ay)L(as) and L(a,*) = L(a,)*. In this paper, we will use regular expressions 
to represent descriptions of finite automata. A deterministic finite automaton (DFA) Aisa 
quintuple (Q, =, ô, qo, F) where Q is a finite set of states, D is the alphabet, 6: Q x E > Q 
is the transition function, go the initial state and F C Q the set of final states. We assume 
that the transition function is total, so we consider only complete DFA’s. The size of a DFA 


is the number of its states, |Q|. Normally, we are not interested in the labels of the states 
and we can represent them by an integer 0 < i < |Q|. The transition function 6 extends 
naturally to &*: for all ge Q, if x = e then ód(g,6) = q; if x = yo then d(q, x) = 6(d(qg, y), o). 
A DFA is initially connected! (ICDFA) if for each state q € Q there exists a string x € X* 
such that ô(qo, £) = q. Two DFA’s A = (Q, £, ô, qo, F) and A’ = (Q', £, 6’, qb, F’) are called 
isomorphic (by states) if there exists a bijection f : Q — Q’ such that f(go) = qj and for 
allo € Nand q € Q, f(ó(g,0)) = 6 (f (q), o). Furthermore, for all q € Q, q € F if and only 
if f(q) € F'. The language accepted by a DFA A is L(A) = {x € &* | (qo, £) E F}. Two 
DFA are equivalent if they accept the same language. Obviously, two isomorphic automata 
are equivalent, but two non-isomorphic automata may be equivalent. A DFA A is minimal 
if there is no DFA A’ with fewer states equivalent to A. Trivially a minimal DFA is an 
ICDFA. Minimal DFA’s are unique up to isomorphism. We are mainly concerned with the 
representation of the transition function of DFA’s of size n over an alphabet of k symbols 
, so we disregard the set of final states and we consider only a quadruple (Q, ©, ô, qo) called 
the structure of an automaton and referred as DFAg. For each of our representations, 
there will be 2” DFA’s. We denote by ICDFAg the structure of an ICDFA. We consider 
that any integer variable has always a nonnegative value (if not otherwise stated). Let 
lilo = {0,1,...,n} and [n] = {1,... n}. 


3 Representations towards a normal form 


The method used to represent a DFA has a significative role in the amount of computer 
work needed to manipulate that information, and can give an important insight about this 
set of objects, both in its characterisation and enumeration. Let us disregard the set of final 
states of a DFA. A naive representation of a DFAg can be obtained by the enumeration of 
its states and for each state a list of its transitions for each symbol. For the DFAg in Fig.1 
we have: 


IA (a: A,b: B)], [B (a: A,b: E)),[C (a: B,b: E), 
[D (a: D,b:C)],[E (a: A,b: E)]. (1) 


Figure 1: A DFA with no final states marked 


Given a complete DFAg (Q, X, ô, qo) with |Q| = n and |X| = k and considering a total 
order over X, the representation can be simplified by omitting the alphabetic symbols. For 


TAlso called accessible. 


our example, we would have 
[[A (A, B)], [B (A, E)], [C (B, £)], [D (D, C)], [E (A, E). (2) 


The labels chosen for the states have a standard order (in the example, the alphabetic 
order). We can simplify the representation a bit if we use that order to identify the states, 
and because we are representing complete DFAg’s we can drop the inner tuples as well. We 
obtain 

(0, 1,0, 4, 1, 4,3, 2, 0,4]. (3) 


Because this representation depends on the order we label the states, we have more than 
one representation for each DFAg. Can we have a canonical order for the set of the states? 
Let the first state be the initial state go of the automaton, the second state the first one to be 
referred (excepting qo) by a transition from qo, the third state the next referred in transitions 
from one of the first two states, and so on... For the DFAg in the example, this method 
induces an unique order for the first three states (4, B, E), but then we can arbitrate an 
order for the remaining states (C, D). Two different representations are thus admissible: 


(0, 1,0,2,0,2,3,4, 1,2] and [0, 1,0, 2,0, 2, 1,2, 4, 3]. (4) 


If we restrict this representation to ICDFAg’s, then this representation is unique and defines 
an order over the set of its states. In the example, the DFAg restricted to the set of states 
{A, B, E) is represented by [0,1,0,2,0,2]. Let © = {o; | i < k}, with oo < 01 < +++ < of-1. 


Given an ICDFAg (Q, £, ô, qo) with |Q| = n, the representing string is of the form [(S;);<kn] 
with S; € [n— 1]o and S; = ó([i/k|, oi mod k). In Figure 2, we present an algorithm for obtain 
these string representation. 


uniqueStr { 


[] 
Ord (go) 


while i<j: 
for 1 in [k-—1l: 

if Ord(ô(Ord!(i),0/))) not defined then 
DES dp ste A 
Ord(ô(Ord !(i),0)) = j 

S = S + [Ord(ó(Ord !(i),0/))] 

=i+1 

return S 


i 


Figure 2: Obtaining the string representation of an ICDFAg. 


Lemma 1. Let [(S;)i<kn] be a representation of a complete ICDFAg (Q, £, ô, qo) with |Q| = n 
and |E| = k, then: 


(Vm > 1)(Vi)(S; =m => ((Aj < i) S; =m-—1)) (R1) 
(Vm e [n — 1] (Ej < km) Sj = m) (R2) 


1 2 n—2 n-—1 


first occurences 


Figure 3: R1 states that first refferences to each state occur sequencially. 


m 


0 1 l-1 l m 


Figure 4: R2 ensures that before the appearance of the set of transitions from a given state 
at least a reference to that state must appear in the string. 


Proof. (of Lemma 1) The condition R1 establishes that a state label (greater than 1) can 
only occur after the occurrence of its predecessors. This is a direct consequence of the way 
we defined the representing string. 

Suppose R2 does not verify, thus there exists a state m that does not occur in the 
first km symbols of the string (the m first state descriptions). Because the automaton 
is initially connected there must be a sequence of states (m,;)j<; and symbols (0;);<; such 
that mo = 0, m = m and (mi, oi) = mi for i < l. We must have 0 < m < mıı 
because m appears in the m;-4 description and we supposed no occurrences of m in the first 
m state descriptions. There must exist | < l such that my, < m < my, implying that 
my E {S; | i < km}. This contradicts R1 because we are supposing that m ¢ (S; | i < km} 
and m < my. Thus R2 is verified. O 


Note that the conditions R1 and R2 are independent. For k = 2 and n = 3, the string 
[2,1,0,0,1,0] satisfies R2 but not R1, and the opposite occurs for the string [0, 0, 1,1,0, 2]. 


Lemma 2. Every string |(Si)ickn] with S; € [n — 1]o satisfying R1 and R2 represents a 
complete ICDFAg with n states over an alphabet of k symbols. 


Proof. Let [(S;)ickn] be a string in the referred conditions, and consider the associated 
automaton A using the string symbols as labels for the corresponding states. By its con- 
struction, A is a deterministic complete finite automaton structure. We only need to prove 
that it is initially connected. Let m be a state of the automaton. 

A proof that m is reachable from the initial state 0 can be done by induction on m. 

If m = 0 there is nothing to prove. If m = 1 then, by R2, 1 must occur in the description 
of state 0, making state 1 reachable from state 0. 

Let us suppose that every state m’ < m is reachable from state O and prove that state m 
is reachable too. By R2, m occurs at least once before position km, say in position km’ + i 
with m’ < m andi < k. Then for some symbol o, ó(m',0) = m. By induction hypothesis, 
state m/ is reachable from state 0, thus state m is reachable too and the automaton is initially 
connected. 

Now consider the string representation obtained for A, [(S:)i<kn]. By Lemma 1 it satisfies 
R1 and R2. It is easy to see that this representation is the same as [(S;);ckn|. By R1, 


So = So. Suppose that (Vi < j)(S; = Si). Now we prove that Si = Sj. By R1, either 
Sj E {5; |i < j} or Sj = max{S; |i < j}+ 1. In the first case, there exists | < j such that 
Sj = Sı and, by induction hypothesis, S; = S}, thus 


Sj = OILAR modk) 
= 6(|I/k],07 moa k) 
Si 
= Si. 
Analogously, by R1, in the second case we have that 
S; = max{ S; |i <j} +1= Ge: 


O 


Theorem 1. There is a one-to-one mapping between strings (Si)ickn] with Si € [n — 1o 
satisfying R1 and R2, and the non-isomorphic ICDFAg’s with n states, over an alphabet 3 
of size k. 


Proof. Let (Q,5,6,90) and (',2,6',99) be two ICDFAg’s and [(S;)ickn] and [(S;)i<kn] 
their representing strings. By Lemma 1, these strings satisfy R1 and R2. Suppose that 
f : Q — Q' is an isomorphism between the ICDFAg’s. Then 0 = go and f(go) = q = 0. 
Either So = 6(qo,00) = qo = 0 or So = ô(so, co) = 1 (by R1). 
i) If So = 0 then f(go) = f(ó(q0,00)) = 6(q9,00) = Sh = 0, because 5(go, co) = qo implies 
(F (qo), 70) = F (qo). 
ii) If Sp = 1 then f(1) = 6'(qj, 00) = So £0, thus S$ = 1, again by R1. 
Supposing that (Vi < j)(S; = S; A f(S;) = Si) we need to prove that S; = Si A f(S;) = Si. 
Trivially we have {5; | i < j} = (Si | i< j}. We know that S; = ô( [j/k], cj mod k), and by 
R2 there exists | < j such that |j/k] = Sı thus f(|j/k|) = F(S) = Sı = Si = [j/k] by 
induction hypothesis. We have 
Si = 8(Lj/k],0) moan) = PECLI/KI), 0; moar) = FOLI/k 0; moa x) = FCS). 
By R1, either S; € {S; |i < j} or Sj = max{ S; |i < j} +1. 
i) If S; € {S; | i< j} then there exists | < j such that S; = S; and S; = Sj. Then 


O(L9/k],0j moak) = ôlll/k], imoa) => FCLI/k], oj moak)) = F(ECLL/k], 07 moa k)) 
= (li/k],0;modk) = F (U/k], 01 mod k) 
Thus S; = S; implies S; = Sj, and so S; = Sj. 

ii) If S$; = max{S; | i < j} +1 then S} ¢ {S; | i < j} because if there exists a | < j 
such that Sj = Si by the same reason as before S; € {S; | i < j}. Thus, by R1 
S; = max{ S; | i < j} +1 = 5}. 

Conversely, by Lemma 2, we have that each string represents a ICDFAg up to a com- 


patible renaming of states, i.e., if two ICDFAọ’s are represented by the same string, that 
representation defines a isomorphism between them. O 


These string representations lead to a normal representation for ICDFAg’s. For each of 
them, if we add a sequence of final states, we obtain a normal form for ICDFA’s. 


4 Generating automata 


Normal representations for ICDFAg’s (as presented above) can be used as compact computer 
representations for that kind of objects, but even though rules R1 and R2 are quit simple, 
it is not evident how to write an enumerative algorithm in an efficient way. In a string 
representing an ICDFAg with n states over an alphabet of k symbols, [(S;);<kn|, let (fjo<jen 
be the sequence of indexes of the first occurrence of each state label 7. That those indexes 
exist is a direct consequence of the way the string is constructed. Now consider 


by = fi td; 
bj = fy—fj-1,for2<j7<n-1 
bn = kn — fa. 


MA a => 
“NID a 
SD Sere o 


fi fo fji fj fn-1 
by bo b; bn 
Note that ; 
j 
Sob = fit, forj efn- (8) 
l=1 


It is easy to see that 
1. Rule R1 simply states that 
(Y2 < j < n — 1) (b; > 0). (G1) 
2. Rule R2 establishes that 
(Vm e [n — 1])(fm < km). (G2) 


To generate all the automata, for each allowed sequence of (b;)o<j<n we can generate all the 
remaining symbols S; (those with i ¢ {f; | 0 < j < n}) according to the following rules: 


i< by > Si = 0; (G3) 
(Vj € n- 2) < i< fji > Se lo); (G4) 
i > Ti => S; € [n = Ho. (G5) 


5 Enumeration of ICDFA’s 


In this section we obtain a formula By.(n) for the number of strings [(S;)i<kn| representing 
ICDFAg’s with n states over an alphabet of k symbols. Although it is already known a 
formula for the number of non-isomorphic ICDFAg’s, we think that our method is new. 
Liskovets [Lis69] and, independently, Robinson [Rob85] gave for that number the formula 
Ayn) = —— where hy(1) = 1 and for n > 1 


Note that nº” is the number of transition functions, from which we subtract the number 
of them that have n — 1, n — 2,...,1 states not accessible from the initial state. And then, 
we may divide by (n — 1)!, as the names of the remaining states (except the initial) are 
irrelevant. Reciprocally, the formula we will derive (By(n)) is a direct positive summation. 

First, let us consider the set of strings [(S;);<kn] with S; € [n — 1l and satisfying only 
rule R1. The number of these strings gives an upper bound for B;(n). This set can be given 
by An N [n — 1]k”, where for c > 0, 


c—1 i 
Ae = L0 +) 0° ]] s(0+---+5)*). (10) 
i=1 j=l 


These languages belong to a family of languages Le presented by Moreira and Reis [MR05b] 
and that represent partitions of [n] with no more than c > 1 parts, i.e., 


L=16O [a+ +). (11) 


i=1 j=1 


We have that pa.(n) = pr.(n+1) and that pr.(n) = 5524 S(n,i), where S(n,i) are Stirling 
numbers of second kind. So we get that the number of strings of length kn that are in An, 
is pa, (kn) = 54 S(kn + 1,i). We have the proposition, 


Proposition 1. For alln, k > 1, Bin) <> S(kn + 1,i). 
For n = 3 and k = 2, B2(3) < 365. For k = 2, Bassino and Nicaud [BN] presented a 


better upper bound, namely that Bo(n) < nS (2n, n). 
Now let us consider only the rule R2. This rule can be formulated as 


From this formula it is easy to see that the strings [(Si)i<kn] with S; € [n — 1]o and satisfying 
only rule R2 can be represented by the regular expression 


N 3 (0+---+(m = 1m0 -H (n— 1), (13) 


where we extended the operators of regular expressions to intersection. 

Now in order to simultaneously satisfy rules R1 and R2, in formula (13), the first 
occurence of m must precede the one of m — 1, for 2 < m < n — 1. These positions are 
exactly the sequence (f;)o<j<n defined in Section 4. Given these positions and considering 
the correspondent sequence (b;)o<j<n we obtain the regular expression: 


n—l 


[[@+-..+G- 217 | +--+ (nr, 
j=l 


and we must consider the possible values of (b;)Jo<;<n, constrained to G1 and G2: 


k 2k-b) kni) -Erbo (nd 


Soa S To+...+G- 05 0a 


bi=1 bo=1 bn—1=1 j=1 


For n = 3 and k = 2 we have 


(01 +1(0+1))(0+1)2 + 20041 + 2)/)(0 414 2)? + 12004142), 


and the number of these strings is (1 + 2)((2 + 3)32) +34 = 216. 
For each sequence (b;)o<j<n the number of strings [(S;)i<kn] with S; € [n — 1]o and 
satisfying R1 and R2 is 
n 
j=l 


a direct consequence of rules G3, G4 and G5. And then we must take the sums over all b; 
constrained to rules G1 and G2. 


Theorem 2. We have 


k 2k—b 3k—bj—bgp  k(n-D-DEPd n 


n= 5Y 5. D Heh (15) 


bj=1 b2=1 b3=1 brim] 
Proof. It is an immediate consequence of rules G1 to G5. O 


The above formula can also be rewritten using the sequence (f;)g<icn, and considering 
fn = kn: 
k=1 2k-1 3k-1 k(n-1)-1 n 


2, A 


=0 fo=fi+l fs=fo+1 fn-1=fn-2+11=2 


Corollary 1. The number of non-isomorphic ICDFA’s with n states over an alphabet of k 


symbols is 
2” Bu(n). (16) 


Proof. By Theorems 1 and 2 and considering the possible sets of final states. O 


6 Conclusion 


The method described in Section 4 was implemented in Python [pyt] and used to generate 
all ICDFAg’s for k = 2 and n < 10, and k = 3 and n < 7. The time complexity of the 
program is linear in the number of automata and took about a week to generate all the 
referred ICDFAg’s, in a PPC G4 1.5MHz. 

One of the advantage of this method is that only the allowed strings are computed so it 
is not a generate-and-test algorithm and because automata are generated in lexicographic 
order it is easy to generate them as needed for consumption by another algorithm. 

The formula By(n) was also implemented and its values where computed for k = 1..10 
and n = 1..9. Those values are presented in Appendix A. The sequences Bo(n) and Bs(n) 
appear in Sloane [Slo03] as A082165 and A065756, respectively. If an ICDFA with n 
states accepts a finite language then there exists a topological order of its states such that 
d(t,0) > i, for allt < n — 1 and ø € X. But the order we used for string representations 
is not a topological order. So we can not determine directly from the string if the accepted 
language is finite, as was done by Domaratzki [Dom04] only for finite languages. Although 
the formula B;(n) is quite similar to the one obtained in [Dom04] for an upper bound of the 
number of finite languages, the meaning of the parameters (b;) are not directely related. 
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A Experimental Results 


In this appendix we present the number of ICDFA's non-isomorphic without final states for 


n = 1..9 states and k = 2..10 alphabetic symbols. 


k= 2 n k= 3 n 
1 1 1 1 
2 12 2 56 
3 216 3 7965 
4 5248 4 2128064 
5 160675 5 914929500 
6 5931540 6 576689214816 
F 256182290 A 500750172337212 
8 12665445248 8 572879126392178688 
9 705068085303 9 835007874759393878655 
k=4 n k=5 n 
1 1 1 1 
2 240 2 992 
3 243000 3 6903873 
4 642959360 4 175483321344 
5 3508208993750 5 11826519415721875 
6 34253071111894176 6 1744085190146957291232 
T, 544271118689873008532 F 494949686355427145872161111 
8 13147735690099619023732736 8 246491144450280856073240885624832 
9 458677874292647947600097994111 9 200977948941552280610264305518977871090 
k=6 n 
1 1 
2 4032 
3 190505196 
4 46086910722048 
5 38056697263376203125 
6 84121943186006445713224896 
7 423117794749852189502006410905462 
8 4310798840913881378315033530121291563008 
9 81510780531114326278646228956855976801744959908 
k=7 n 
1 1 
2 16256 
3 5192233245 
4 11921614605697024 
5 120315894541852283281250 
6 3976063029034767886935933510912 
7 353521348806151995743455800832981571314 
8 73484638707005629827978811367001966356732051456 
9 32134987099884609628834726023582411808822980002131697574 
k=8 n 
1 1 
2 65280 
3 140764942800 
4 3065045074098257920 
5 377746484367585519367187500 
6 186463110898012043254861617993372672 
7 292790327511533355186380818285419369165134504 
8 1240517859367854140741786003068555614652944740664737792 
9 12533845 162122187320986901745839566315023480777415952875118142242 
k=9 n 
1 1 
2 261632 
3 3807455329593 
4 786050986901533097984 
5 1182694443740139221396759765625 
6 8717477417765526110669606920661061954048 
7 241663209893166029311235709449296848489007 150038885 
8 20862781312540752296309668431262192459252081308963680368459776 
9 4868562054782101154240008904969374335289040629362192719160637468384235331 
k=10 n 
1 1 
2 1047552 
3 102881965757076 
4 201378988990926052917248 
5 3698771376375809074323775654296875 
6 407056620031409364982690175796310640877007872 
7 199195425299637859859159104431333727959687905790340860554 
8 35035077358953741660493447 1527510136835511671254200548676664702271488 
9 1888096336032066333099268007451472025946469500517722087924581588200472709241234833 
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